Skip to content

GH-50162: [C++][Parquet] Avoid int32 overflow in BitPackedRunDecoder::GetBatch offset#50089

Open
metsw24-max wants to merge 2 commits into
apache:mainfrom
metsw24-max:rle-bitpacked-offset-overflow
Open

GH-50162: [C++][Parquet] Avoid int32 overflow in BitPackedRunDecoder::GetBatch offset#50089
metsw24-max wants to merge 2 commits into
apache:mainfrom
metsw24-max:rle-bitpacked-offset-overflow

Conversation

@metsw24-max

@metsw24-max metsw24-max commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

int32 overflow in the bit-packed run decoder offset
GetBatch works out the byte position with values_read_ * value_bit_width in 32-bit int. For a large bit-packed run (this decodes untrusted parquet RLE/bit-packed dictionary indices and levels, with value width up to 64) the product passes INT32_MAX and wraps negative, so bytes_fully_read goes negative and unread_data ends up before the buffer, giving an out of bounds read in unpack. raw_data_size just above already widens to int64 before the same multiply, so I matched that here.

@github-actions github-actions Bot added the awaiting review Awaiting review label Jun 4, 2026
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@pitrou

pitrou commented Jun 10, 2026

Copy link
Copy Markdown
Member

@metsw24-max Thanks for submitting this PR! Can you open a corresponding GitHub issue as instructed by the comment bot above?

@AntoinePrv Would you like to review these changes?

@AntoinePrv AntoinePrv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally we would want to check if the untrusted int is within the bounds of the parquet specs, and if not error (here it may be return negative value?).

Though is the bit_width check done before? if not, should it?

Comment on lines +381 to +382
/* .bit_offset= */ static_cast<int>(bits_read % 8),
/* .max_read_bytes= */ static_cast<int>(max_read_bytes_ - bytes_fully_read),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same way that it may previously not fit in an int do we know that it will fit it at this point (and not turn negative)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For parser-produced runs it does hold: PeekImpl only emits a run whose whole payload fits in the remaining buffer (it truncates or rejects the run otherwise), and the BitPackedRun constructor DCHECKs that invariant. Since values_read_ <= values_count_, bytes_fully_read <= max_read_bytes_, so the difference stays in [0, max_read_bytes_], which is itself an rle_size_t. The remaining case is the negative sentinel (no bound), where the difference stays negative and unpack treats any negative value the same as -1. Added a DCHECK at the subtraction site to make that explicit.

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 11, 2026
@metsw24-max metsw24-max changed the title avoid int32 overflow in BitPackedRunDecoder::GetBatch offset GH-50162: [C++][Parquet] Avoid int32 overflow in BitPackedRunDecoder::GetBatch offset Jun 12, 2026
@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #50162 has been automatically assigned in GitHub to PR creator.

@metsw24-max metsw24-max requested a review from pitrou as a code owner June 12, 2026 12:34
@metsw24-max

Copy link
Copy Markdown
Contributor Author

@pitrou done, opened GH-50162 and renamed the title to match.

@AntoinePrv on the spec bounds: the bit width is already checked before it reaches this decoder. Dictionary indices reject anything above 32 in DictDecoderImpl::SetData, and for rep/def levels the width isn't read from the file at all, it's derived from max_level. Run lengths are bounded by the parser, which truncates or rejects a run that would overflow the buffer. The catch is that the values overflowing here are all within spec: a single bit-packed run can validly hold close to 2^31 values, so values_read_ * value_bit_width passes INT32_MAX on legitimate data once a run grows past 256 MiB. So I don't think there's an out-of-spec value to error on at this level; the intermediate just needs the wider type, same as raw_data_size above. Happy to add an explicit error path if you'd rather have one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants